The dataset we used in this assignment here is downloaded from GEO with id GSE179448. The dataset includes RNAseq analysis of human CD4+ regulatory Tregs and Tconvs in COVID-19 and healthy donors isolated from peripheral blood. The dataset was performed on using GPL18573 Illumina NextSeq 500 platform.
The dataset we used is linked to the published article: “Profound Treg pertubations correlate with COVID-19 severity”[1]. Regulatory T cells (Tregs) are a subset of T cells that play an important role in maintaining immune homeostasis and preventing autoimmunity. Patients with COVID-19 have significant perturbations in Treg populations, which can have important implications for disease pathogenesis and clinical outcomes. This article performed a RNASeq on differnt donor samples at different health stages in regulatory T cells and conventional T cells.
There are in total of two supplementary files provided in the dataset
called “GSE179448_Gene_count_table.tsv.gz” and
“GSE179448_Metadata_GalvanLeon_COVID19_Treg_ULIRNAseq.xlsx”.
The
first one contains count data from RNASeq, and the second excel file
contains information of the genral information on each sample. The first
file is used to extract all count data. The second excel file is used
for creating labels and defining groups. Both files are used to group
defining and data cleaning processes.
There is a total of 86 donor samples in the dataset, where 43 of them are Regulatory T cell samples from donors, and the other 43 samples are conventional T cell samples from donors. There are four groups of donors included: healthy donors(H4), mild outpatients(M3), recovered (R2) and hospitalized to severe(S1). There are 56120 genes in the unprocessed dataset. After filtering out the low count data, I have removed 39466 outliers(count per million(cpm) less than 1). There is 16654 genes left after low count filtering.
Comparing the boxplot before and after normalization, it is evident that the log2CPM and median of log2CPM remain relatively stable. However, the median of each sample or replicate becoming more closely aligned.The distribution of the density plot is not perfect as in there is a smaller peak before the main peak. This could be caused by mixing two cell types together in the normalization process.
The dispersion of the dataset shows variations in the trend line. The variation seen here can come from either biological or technical source in a RNASeq experiment.
There are a total of 2154 expression data not mapped.
After this data cleaning process, we have 14500 normalized gene expression left excluding the ones are not being mapped 2154 expression. They make up the 16654 genes after removing low counts. In this assignment, we will continue to perform differntial gene analysis.
Install packages if required
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
if (!requireNamespace("knitr", quietly = TRUE)){
install.packages("knitr")}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)){
install.packages("ComplexHeatmap")}
if (!requireNamespace("circlize", quietly = TRUE)){
install.packages("circlize")}
if (!requireNamespace("magick", quietly = TRUE)){
install.packages("magick")}
if (!requireNamespace("RColorBrewer", quietly = TRUE)){
install.packages("RColorBrewer")}
if (!requireNamespace("edgeR", quietly = TRUE)){
BiocManager::install("edgeR")}
if (!requireNamespace("Biobase", quietly = TRUE)){
BiocManager::install("Biobase")}
if (!requireNamespace("kableExtra", quietly = TRUE)){
install.packages("kableExtra")}
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
if (!requireNamespace("dplyr", quietly = TRUE)){
install.packages("dplyr")}
[2] [3] [4] [5] [6] [7] [8] [9] [10]
Load Packages
library("knitr")
library("ComplexHeatmap")
library("circlize")
library("magick")
library("RColorBrewer")
library("edgeR")
library("Biobase")
library("kableExtra")
library("dplyr")
# Retrieve information from the normalized data file generated from Assignment 1 by reading the text file "tcell_normalized_count_2023.txt".
# Use file.path() function to make sure it will compile in any system.
# The file includes header and are seperated by space.
T_norm <- read.table(file = file.path(getwd(), "data",
"tcell_normalized_count_2023.txt"),
header = TRUE,
sep = " ",
check.names = FALSE)
# View the first five rows and columns of the data retrieved.
knitr::kable(T_norm[1:5, 1:5], type = "html")
| pt1.H4.TR | pt4.H4.TR | pt6.H4.TR | pt7.H4.TR | pt8.H4.TR | |
|---|---|---|---|---|---|
| A1BG | 8.1753000 | 3.828280 | 1.7599736 | 2.228690 | 2.346909 |
| A1BG-AS1 | 2.5154769 | 1.531312 | 4.2742217 | 3.900208 | 4.302666 |
| A2M | 20.4382501 | 59.338344 | 11.8169659 | 23.401247 | 36.768237 |
| A2M-AS1 | 14.1495578 | 21.438370 | 3.7713721 | 14.486486 | 16.819513 |
| A2ML1 | 0.6288692 | 1.148484 | 0.7542744 | 4.457380 | 1.955757 |
From the table above, the column names are sample labels, and the row names are HGNC gene symbol. The sample label is in the format ptx.yy.zz where ptx represents patient x where x is a number. Each number represents one patient. yy represents the confition the patient is in, it includes four different levels: Level 1 Hospitalized to severe(S1), Level 2 Mild Outpatients(M2), Level 3 Recovered(R3), Level 4 Healthy(H4). The normalized expression data are listed in the columns under the sample label. Each row represent a single gene.
After loading our data, let us take an overall look at the data and see what the data looks like.
hm <- T_norm[,1:ncol(T_norm)]
rownames(hm) <- rownames(T_norm)
colnames(hm) <- colnames(T_norm[,1:ncol(T_norm)])
knitr::kable(hm[1:5, 1:5], type = "html")
| pt1.H4.TR | pt4.H4.TR | pt6.H4.TR | pt7.H4.TR | pt8.H4.TR | |
|---|---|---|---|---|---|
| A1BG | 8.1753000 | 3.828280 | 1.7599736 | 2.228690 | 2.346909 |
| A1BG-AS1 | 2.5154769 | 1.531312 | 4.2742217 | 3.900208 | 4.302666 |
| A2M | 20.4382501 | 59.338344 | 11.8169659 | 23.401247 | 36.768237 |
| A2M-AS1 | 14.1495578 | 21.438370 | 3.7713721 | 14.486486 | 16.819513 |
| A2ML1 | 0.6288692 | 1.148484 | 0.7542744 | 4.457380 | 1.955757 |
if(min(hm) == 0){
# Check values in hm(heatmap matrix generated from my normalized data) contain
# negatives. If the matrix only contain positive values, two color scale will
# be used.
hm_col = circlize::colorRamp2(
breaks = c( 0, max(hm)),
colors = c( "white", "red"))
} else {
# If there are both negative and positive values, three color scale will be used.
# Blue represents negative values, white represents 0 and red represents positive values.
hm_col = circlize::colorRamp2(
breaks = c(min(hm), 0, max(hm)),
colors = c("blue", "white", "red"))
}
# use ComplexHeatmap package to create Heatmap
grDevices::png(filename = file.path(getwd(), "figures",
"t_heatmap_before_row_normalization.png"),
width = 24,
height = 16,
units = "cm",
res = 600)
tcell_heatmap <- ComplexHeatmap::Heatmap(
# Set parameters for tcell_heatmap
matrix = as.matrix(hm),
# set columns
col = hm_col,
# row dendgram shows the gene cluster
show_row_dend = TRUE,
# column dendogram shows the samples cluster.
show_column_dend = TRUE,
show_column_names = TRUE,
column_title = "Normalized Expression of Treg and Tconv cells before Row Normalization",
column_title_gp = gpar(fontsize = 14),
column_names_max_height = unit(8, "cm"),
column_names_gp = gpar(fontsize = 6),
column_names_rot = 90,
show_row_names = FALSE,
# Set legend parameters
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "Expression"),
# Set raster image parameters
use_raster = FALSE,
raster_device = "png",
raster_quality = 1,
raster_resize_mat = FALSE,
raster_magick_filter = NULL
)
draw(tcell_heatmap)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures",
"t_heatmap_before_row_normalization.png"))
Figure 1: Normalized Expression of Treg and Tconv Cells before Row Normalization. The dendrogram on the top and left side represents clusters. The legend represents level of expression. Highest expression in red and lower expression in white.
As the heatmap shown, there is absolutely nothing shown in the heatmap. This could be a result of the high expression of a single gene that has the expression of 40000, and that eliminates all other signals in the heatmap. Therefore, we can perform a row normalization of the heatmap to solve this.
Note: Considering the width of the html doc, I am not able to
make the font any larger to still clearly show the labels with no
overlapping. I also tried the annotation function in ComplexHeatmap, but
I was not able make it label by condition and cell type.
# Scale the matrix by row normalization of each gene
hm_rnorm <- t(scale(t(hm)))
if(min(hm_rnorm) == 0){
# Check values in hm(heatmap matrix generated from my normalized data) contain
# negatives. If the matrix only contain positive values, two color scale will
# be used.
hm_rnorm_col = circlize::colorRamp2(
breaks = c( 0, max(hm_rnorm)),
colors = c( "white", "red"))
} else {
# If there are both negative and positive values, three color scale will be used.
# Blue represents negative values, white represents 0 and red represents positive values.
hm_rnorm_col = circlize::colorRamp2(
breaks = c(min(hm_rnorm), 0, max(hm_rnorm)),
colors = c("blue", "white", "red"))
}
# use ComplexHeatmap package to create Heatmap
grDevices::png(filename = file.path(getwd(), "figures",
"t_heatmap_after_row_normalization.png"),
width = 24,
height = 16,
units = "cm",
res = 600)
tcell_heatmap_rnorm <- ComplexHeatmap::Heatmap(
# Set parameters for tcell_heatmap
matrix = as.matrix(hm_rnorm),
# set columns
col = hm_rnorm_col,
# row dendgram shows the gene cluster
show_row_dend = TRUE,
# column dendogram shows the samples cluster.
show_column_dend = TRUE,
show_column_names = TRUE,
column_title = "Normalized Expression of Treg and Tconv cells After Row Normalization",
column_title_gp = gpar(fontsize = 14),
column_names_max_height = unit(8, "cm"),
column_names_gp = gpar(fontsize = 6),
column_names_rot = 90,
show_row_names = FALSE,
# Set legend parameters
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "Expression"),
# Set raster image parameters
use_raster = TRUE,
raster_device = "png",
raster_quality = 1,
raster_resize_mat = TRUE,
raster_by_magick = requireNamespace("magick", quietly = TRUE),
raster_magick_filter = NULL
)
draw(tcell_heatmap_rnorm)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures",
"t_heatmap_after_row_normalization.png"))
Figure 2: Normalized Expression of Treg and Tconv Cells after Row Normalization. The dendrogram on the top and left side represents clusters. The legend represents level of expression. Highest expression in red and lower expression in white.
When we scale the heatmap matrix, we can actually see the
differential expression from the color by looking at the heatmap. From
the heatmap, we can see that the samples does not cluster by patient
sample, but cluster by cell type. If we take a closer look at each
label, we can see that sample also cluster by donor condition. S4 level
samples have clustered expression.
Note: Considering the width of
the html doc, I am not able to make the font any larger to still clearly
show the labels with no overlapping. I also tried the annotation
function in ComplexHeatmap, but I was not able make it label by
condition and cell type.
Then, we will use another way to visualize the data using MDS plot. First, we want to see how the samples are distributing when we color different cell types.
grDevices::png(filename = file.path(getwd(), "figures",
"MDSplot_cell_type.png"),
width = 24,
height = 16,
units = "cm",
res = 600)
limma::plotMDS(hm_rnorm,
col = rep(c("#33A02C","#1F78B4"),c(45,41)), pch = 19,
main = "MDS plot for Treg/Tconv RNASeq Samples by Cell Type")
legend("bottom", legend = c("Treg Cells", "Tconv Cells"),
c = c("#33A02C","#1F78B4"), cex=0.7, inset = c(0, 0),
pch = 19, ncol = 2)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures", "MDSplot_cell_type.png"))
Figure 3: MDS plot by cell type. TR represents regulatory T cells, and TC represents conventional T cells.}
From the MDS plot above, it is clear that the donor samples clusters based on Treg cells and Tconv cells. But how much does donor samples affect the variability of the data? If we look at individual text labels on the second image, we do not see the sample data cluster on patient labels.
grDevices::png(filename = file.path(getwd(), "figures",
"MDSplot_cell_type_text.png"),
width = 30,
height = 24,
units = "cm",
res = 600)
limma::plotMDS(hm_rnorm,
col = rep(c("#33A02C","red", "#33A02C", "#1F78B4", "red", "#1F78B4"),c(6, 1, 38, 7, 1, 33)),
main = "MDS plot for Treg/Tconv RNASeq Samples by Cell Type",
plot = TRUE,
cex = 0.5
)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures", "MDSplot_cell_type_text.png"))
Figure 4: MDS plot using text as labels. Green represents regulatory T cells, and blue represents concentional T cells.
As seen in this figure, the patient does not seem to affect the
distribution of samples. If we focus on patient 11(pt11.H4.TR and
pt11.H4.TC marked in red), they are located in their cluster. We also
have another factor in this sample which is the condition of patients.
Therefore, we will draw one more MDS plot to see if the samples will
cluster based on conditions.
Note: Considering the width of the
html doc, I am not able to make the font any larger to still clearly
show the labels with no overlapping. I have generated another image in
the figures subdirectory in my github repo, you can access this image
from the repo if needed.
# My samples are formatted as ptx.yy.zz, so each group can be seperated by "."
sample_labels <- data.frame(lapply(colnames(hm_rnorm),
FUN=function(x){unlist(strsplit(x,
split = "\\."))}))
# Define groups
groups <- unique(paste(sample_labels[2,], sample_labels[3,], sep = "."))
hm_rnorm_df <- data.frame(hm_rnorm)
# Create separate dataframes to label each group.
healthy_samples_c <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\H4.TC")]
healthy_samples_r <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\H4.TR")]
mild_samples_c <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\M2.TC")]
mild_samples_r <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\M2.TR")]
recovered_samples_c <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\R3.TC")]
recovered_samples_r <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\R3.TR")]
severe_samples_c <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\S1.TC")]
severe_samples_r <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\S1.TR")]
sorted_hm_rnorm_df <- cbind(healthy_samples_c, healthy_samples_r, mild_samples_c, mild_samples_r, recovered_samples_c, recovered_samples_r, severe_samples_c, severe_samples_r)
hm_rnorm_groups <- sorted_hm_rnorm_df[,1:ncol(sorted_hm_rnorm_df)]
rownames(hm_rnorm_groups) <- rownames(sorted_hm_rnorm_df)
colnames(hm_rnorm_groups) <- colnames(sorted_hm_rnorm_df[,1:ncol(sorted_hm_rnorm_df)])
index_groups <- c(ncol(healthy_samples_c), ncol(healthy_samples_r),
ncol(mild_samples_c), ncol(mild_samples_r),
ncol(recovered_samples_c), ncol(recovered_samples_r),
ncol(severe_samples_c), ncol(severe_samples_r))
# Create device for image
grDevices::png(filename = file.path(getwd(), "figures",
"MDSplot_group.png"),
width = 24,
height = 16,
units = "cm",
res = 600)
#plot graph
limma::plotMDS(hm_rnorm_groups,
col = rep(brewer.pal(8,"Paired"),
index_groups),
main = "MDS plot for Treg/Tconv RNASeq Samples",
pch = 19)
#create legend
legend("top", legend = sort(groups),
c=brewer.pal(8,"Paired"), cex=0.7, inset = c(0, 0),
pch = 19, ncol = 4)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures", "MDSplot_group.png"))
Figure 5: MDS plot using text as labels. Green represents regulatory T cells, and blue represents concentional T cells.
From the above plot, we can identify that cell type of donor sample condition could be important for the expression value.
We can also have hypothesized that the condition(healthy, recovered, mild outpatients, hospitalized to severe). Therefore, we might need to include both conditions in our analysis.
# Create groups datagrame for heatmap
colnames(sample_labels) <- colnames(hm_rnorm)
rownames(sample_labels) <- c("donors", "condition", "cell_type")
sample_labels <- data.frame(t(sample_labels))
knitr::kable(sample_labels[1:5,])
| donors | condition | cell_type | |
|---|---|---|---|
| pt1.H4.TR | pt1 | H4 | TR |
| pt4.H4.TR | pt4 | H4 | TR |
| pt6.H4.TR | pt6 | H4 | TR |
| pt7.H4.TR | pt7 | H4 | TR |
| pt8.H4.TR | pt8 | H4 | TR |
The table above have three columns, donors, condition and cell type. The original sample labels are the row names. We will first start with a simple model that only accounts for the cell type(refered to as the simple model in the text below).
# Build a very simple model that only consider cell type using limma
simple_model <- stats::model.matrix(~ sample_labels$cell_type)
knitr::kable(simple_model[1:5,], type="html") %>% kableExtra::row_spec(0, angle = -3)
| (Intercept) | sample_labels$cell_typeTR |
|---|---|
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
The table above showed the simple model matrix where the first column shows intercept, and the second column shows if the group specified by the column name is true for each sample. 0 means False, and 1 represents True. In our sample, there will be 43 1s(representing the first 43 samples are Treg cells) followed by 43 0s(representing the last 43 groups are Tcoonv cells).
# Create a matrix from the dataframe retrieved in step1 named T_norm
norm_t_exp <- as.matrix(T_norm)
rownames(norm_t_exp) <- rownames(T_norm)
colnames(norm_t_exp) <- colnames(T_norm)
# Define the expression matrix as the assayData
norm_t_set <- Biobase::ExpressionSet(assayData = norm_t_exp)
# Fit the expression set to the model
norm_t_set_fit <- limma::lmFit(object = norm_t_set,
design = simple_model)
# Apply empirical Bayes to compute differential expression
# Set trend = TRUE for RNAseq data
norm_t_set_fit_B <- limma::eBayes(norm_t_set_fit, trend = TRUE)
# Extract top hits
# Use Benjamini-hochberg("BH") method for correction
top_norm_t_set_fit_B <- limma::topTable(fit = norm_t_set_fit_B,
coef = ncol(simple_model),
adjust.method = "BH",
number = nrow(norm_t_exp))
hits <- top_norm_t_set_fit_B[order(top_norm_t_set_fit_B$P.Value),]
knitr::kable(hits[1:10,],type="html",row.names = TRUE, digits = 50)
| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| IKZF2 | 114.84913 | 73.65007 | 27.35733 | 1.418150e-44 | 1.998740e-40 | 54.75425 |
| RTKN2 | 133.08309 | 79.60551 | 25.07200 | 1.222916e-41 | 8.617889e-38 | 52.31201 |
| DENND5A | -91.13979 | 73.55567 | -20.93847 | 8.867301e-36 | 4.165858e-32 | 46.69063 |
| IL2RA | 346.97956 | 204.36533 | 20.25869 | 9.709680e-35 | 3.421206e-31 | 45.58663 |
| THEMIS | -60.39143 | 42.51102 | -19.81293 | 4.800469e-34 | 1.353156e-30 | 44.83121 |
| LINC00426 | 57.40184 | 47.43085 | 18.70444 | 2.826963e-32 | 6.345906e-29 | 42.83860 |
| STAT4 | -111.85723 | 139.65514 | -18.67540 | 3.151791e-32 | 6.345906e-29 | 42.78412 |
| TIGIT | 237.72109 | 165.27967 | 18.37986 | 9.588928e-32 | 1.689329e-28 | 42.22293 |
| FOXP3 | 222.96894 | 120.23002 | 18.24225 | 1.615629e-31 | 2.530075e-28 | 41.95735 |
| TMEM71 | -43.29108 | 39.95185 | -17.90230 | 5.921496e-31 | 8.345757e-28 | 41.28939 |
The above table shows the top hits after fitting the data into the simple model that only consider cell type sorted by P value in an increasing order.
length(which(hits$P.Value < 0.05))
## [1] 7212
length(which(hits$adj.P.Val < 0.05))
## [1] 6411
length(which(hits$P.Value < 0.01))
## [1] 5691
length(which(hits$adj.P.Val < 0.01))
## [1] 4981
We here used the simple model which only consider the cell type as a contributing factor. When we use the threshold of 0.05, there are 7212 genes that are found to be differntially expressed, and 6411 genes passed correction. When we use the threshold of 0.01, there are 5691 genes that are found to be differntially expressed, and 4981 genes passed correction.
As preciously discussed, we hypothesized that donor condition can also be a contributing factor. Therefore, we will construct a more complex model that accounts for cell type and condition as shown below. ### Complex model
# A more complex model, add condition of donor as another factor
complex_model <- model.matrix(~ sample_labels$condition + sample_labels$cell_type)
knitr::kable(complex_model[1:5,1:5],type="html") %>% kableExtra::row_spec(0, angle = -3)
| (Intercept) | sample_labels\(conditionM2 </th> <th style="text-align:right;-webkit-transform: rotate(-3deg); -moz-transform: rotate(-3deg); -ms-transform: rotate(-3deg); -o-transform: rotate(-3deg); transform: rotate(-3deg);"> sample_labels\)conditionR3 | sample_labels\(conditionS1 </th> <th style="text-align:right;-webkit-transform: rotate(-3deg); -moz-transform: rotate(-3deg); -ms-transform: rotate(-3deg); -o-transform: rotate(-3deg); transform: rotate(-3deg);"> sample_labels\)cell_typeTR | ||
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
Here, we added the condition of the donor is into the model design, and we call this model complex model in the following text. (Note: The table above have rendering problems. The table in the previous serction works in rendering, but this figure have some rendering problems. The code used are exactly the same.)
# Fit data to the above model
norm_t_set_fit_comp <- lmFit(norm_t_set, complex_model)
# Apply empircal Bayes to compute differntial expression
norm_t_set_fit_B_comp <- limma::eBayes(norm_t_set_fit_comp, trend = TRUE)
# Pick top hits
# Use Benjamini-hochberg("BH") method for correction
top_norm_t_set_fit_B_comp <- limma::topTable(fit = norm_t_set_fit_B_comp,
coef = ncol(complex_model),
adjust.method = "BH",
number = nrow(norm_t_exp))
hits_comp <- top_norm_t_set_fit_B_comp[order(top_norm_t_set_fit_B_comp$P.Value),]
kable(hits_comp[1:10,],type="html",row.names = TRUE, digits = 50)
| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| IKZF2 | 115.29990 | 73.65007 | 27.56523 | 6.203049e-44 | 8.742577e-40 | 54.17309 |
| RTKN2 | 133.00338 | 79.60551 | 24.57320 | 3.446056e-40 | 2.428436e-36 | 51.01905 |
| IL2RA | 341.59110 | 204.36533 | 21.33512 | 9.991834e-36 | 4.694163e-32 | 46.71107 |
| DENND5A | -90.73612 | 73.55567 | -20.66955 | 9.454604e-35 | 3.331330e-31 | 45.68679 |
| THEMIS | -59.86247 | 42.51102 | -19.75210 | 2.271290e-33 | 6.402313e-30 | 44.18693 |
| TMEM71 | -42.34602 | 39.95185 | -19.64896 | 3.266641e-33 | 7.673339e-30 | 44.01166 |
| FOXP3 | 218.95114 | 120.23002 | 19.45310 | 6.535793e-33 | 1.315935e-29 | 43.67500 |
| LINC00426 | 56.91775 | 47.43085 | 19.04244 | 2.838840e-32 | 5.001327e-29 | 42.95266 |
| IKZF4 | 37.44443 | 28.05649 | 18.84490 | 5.795040e-32 | 9.075033e-29 | 42.59706 |
| STAT4 | -112.05584 | 139.65514 | -18.64377 | 1.204188e-31 | 1.697182e-28 | 42.22948 |
length(which(hits_comp$P.Value < 0.05))
## [1] 7259
length(which(hits_comp$adj.P.Val < 0.05))
## [1] 6444
length(which(hits_comp$P.Value < 0.01))
## [1] 5740
length(which(hits_comp$adj.P.Val < 0.01))
## [1] 5021
length(which(hits_comp$P.Value < 0.001))
## [1] 4339
length(which(hits_comp$adj.P.Val < 0.001))
## [1] 3730
The results from the complex model gives slightly more hits than the simple model. When we use the threshold of 0.05, there are 7259 genes that are found to be differntially expressed, and 6444 genes passed correction. When we use the threshold of 0.01, there are 5740 genes that are found to be differntially expressed, and 5021 genes passed correction. Considering the number of genes found, I increased the threshold to 0.001, and still got hits of 4339 differentially expressed genes, and 3730 of them pass the correction.
Then, we will take a look and see which model performs better by ploting the results we have from the simple and complex model.
# Create simple model dataframe
simple_model_p <- data.frame(HGNC_symbol = rownames(hits),
simple_pvalue = hits$P.Value)
# Create complex model dataframe
complex_model_p <-data.frame(HGNC_symbol = rownames(hits_comp),
comp_pvalue = hits_comp$P.Value)
compare_models <- merge(simple_model_p, complex_model_p,by.x=1,by.y=1)
# Set up colors for models
compare_models$colour <- "black"
compare_models$colour[compare_models$simple_pvalue<0.01] <- "orange"
compare_models$colour[compare_models$comp_pvalue<0.01] <- "blue"
compare_models$colour[compare_models$simple_pvalue<0.01 &
compare_models$comp_pvalue<0.01] <- "red"
#Create device
grDevices::png(filename = file.path(getwd(), "figures",
"model_comparison.png"),
width = 24,
height = 16,
units = "cm",
res = 600)
#Plot
plot(compare_models$simple_pvalue,
compare_models$comp_pvalue,
col = compare_models$colour,
xlab = "Simple model p values",
ylab ="Complex model p values",
main="P Value of Simple model vs Complex model using Limma")
#Create legend
legend("bottomright", legend = c("P.Value ≥ 0.01", "Simple P.Value < 0.01", "Complex P.Value < 0.01", "Both P.Value < 0.01"),
c = c("black", "orange", "blue", "red"), cex=0.7, inset = c(0, 0),
pch = 19)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures", "model_comparison.png"))
Figure 6: P value comparisons of Simple model and Complex model using limma. Each circle in the plor represent a sample. The color of the dot correponds to the legned. Both model used Benjamini-Hochberg method for calculating the false discovery rate. The p value threshold for plotting is p < 0.01.
From the above figure, we can see that all samples in the simple
model that are significant are also in the complex model(We do not see
orange but we do see a small blue region at the bottom left of the
figure). The complex model appears to be more powerful than the simple
model.
To visualize the differntially expressed genes, we will
draw a volcano plot and labels the genes with most significant
differential expression.
# Color all points grey
vol_color = rep('grey', times = nrow(hits_comp))
# Filter down-regulated genes, that met three conditions: fold change > 2, P value and adj.P.Val < 0.001
# Color down-regulated genes light blue #A6CEE3
vol_color[hits_comp$logFC < 0 &
hits_comp$P.Value < 0.001 &
hits_comp$adj.P.Val < 0.001 &
abs(hits_comp$logFC) > 2] <- '#A6CEE3'
# Filter up-regulated genes, that met three conditions: fold change > 2, P value and adj.P.Val < 0.001
# Color down-regulated genes light red #FB9A99
vol_color[hits_comp$logFC > 0 &
hits_comp$adj.P.Val < 0.001 &
hits_comp$P.Value < 0.001 &
abs(hits_comp$logFC) > 2] <- '#FB9A99'
# Output the image
grDevices::png(filename = file.path(getwd(), "figures",
"volcano_plot_for_differntial_expressed_genes.png"),
width = 20,
height = 12,
units = "cm",
res = 600)
# Plot
plot(hits_comp$logFC,
-log(hits_comp$P.Value, base = 10),
col = vol_color,
xlab = "log2 Fold Change",
ylab = "-log10(p value)",
main = "Volcano Plot of Differentially Expressed Genes",
pch = 19
)
# Create ligend
legend("topright",
legend=c("Downregulated genes", "Upregulated genes", "Not differntially expressed"),
fill = c("#A6CEE3", "#FB9A99", "grey"),
cex = 0.9,
pch = 19)
# Label genes with high fold change (>800)
labeled_genes <- which(abs(hits_comp$logFC) > 800 & hits_comp$P.Value < 0.001 & hits_comp$adj.P.Val < 0.001)
text(x = hits_comp$logFC[labeled_genes] ,
y = -log10(hits_comp$P.Value[labeled_genes]),
label = rownames(hits_comp)[labeled_genes],
cex = 0.7, adj = c(1, NA), pos = 3)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures", "volcano_plot_for_differntial_expressed_genes.png"))
Figure 7: Volcano Plot for Differential Expressed Genes. Blue representing downregulated genes, red represents up regulated genes, and grey represents genes that did not pass threshold. Threshold are set as p < 0.001, adjacent p vale < 0.001, and absolute value of log(fold change) > 2. The most up and down regulated genes are labeled with gene symbol in black. Genes with log(fold change) > 800 are labeled in the figure.
As you can see from the above figure. There are many genes with a really high fold change. The labeled genes have a very high fold change(logFC > 800). This threshold is chosen to avoid over crowding the figure. Then, let’s also try visualize the differntially expressed genes using a heatmap.
hits_stringent <- rownames(hits_comp[hits_comp$P.Value < 0.001,])
#Do row normalization
hm_hits_strigent <- t(scale(t(hm[which(rownames(hm) %in% hits_stringent),])))
if(min(hm_hits_strigent) == 0){
# Check values in hm(heatmap matrix generated from my normalized data) contain
# negatives. If the matrix only contain positive values, two color scale will
# be used.
hm_hits_strigent_col = circlize::colorRamp2(
breaks = c( 0, max(hm_hits_strigent)),
colors = c( "white", "red"))
} else {
# If there are both negative and positive values, three color scale will be used.
# Blue represents negative values, white represents 0 and red represents positive values.
hm_hits_strigent_col = circlize::colorRamp2(
breaks = c(min(hm_hits_strigent), 0, max(hm_hits_strigent)),
colors = c("blue", "white", "red"))
}
# use ComplexHeatmap package to create Heatmap
grDevices::png(filename = file.path(getwd(), "figures",
"t_heatmap_for_differntial_expressed_genes.png"),
width = 24,
height = 16,
units = "cm",
res = 600)
tcell_heatmap <- ComplexHeatmap::Heatmap(
# Set parameters for tcell_heatmap
matrix = as.matrix(hm_hits_strigent),
# set columns
col = hm_hits_strigent_col,
# cluster settings
cluster_rows = TRUE,
cluster_columns = TRUE,
# row dendgram shows the gene cluster
show_row_dend = TRUE,
# column dendogram shows the samples cluster.
show_column_dend = TRUE,
show_column_names = TRUE,
column_title = "Normalized Expression of Differential Expressed Genes",
column_title_gp = gpar(fontsize = 16),
column_names_max_height = unit(8, "cm"),
column_names_gp = gpar(fontsize = 6),
column_names_rot = 90,
show_row_names = FALSE,
# Set legend parameters
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "Expression"),
# Set raster image parameters
use_raster = TRUE,
raster_device = "png",
raster_quality = 1,
raster_resize_mat = TRUE,
raster_by_magick = requireNamespace("magick", quietly = TRUE),
raster_magick_filter = NULL
)
draw(tcell_heatmap)
invisible(dev.off())
knitr::include_graphics(file.path(getwd(), "figures",
"t_heatmap_for_differntial_expressed_genes_annotated.png"))
Figure 8: Heatmap for differential expressed genes. The expression level are shown in a color scale from blue to red(from low to high). The differntial expressed genes are analyzed using limma, genes selected here used the threshold of p < 0.001. The red box marks the profound regulatory t cells pertubations in donors, and the yellow box marks the profound conventional t cells pertubations in donors.
As seen in the figures above, the samples cluster nicely by cell
type. The left side of clustered samples represent regulatory T cell
samples from donor samples, whereas the right side of the clustered
samples represent conventional T cell samples from donor samples. These
two clusters have very different gene expressions. The top halves of the
genes are highly expressed in regulatory T cells, but the bottom halves
of the genes are highly expressed in conventional T cells.
If we
take a look at the left and the right halves of the figure separately,
we can see differential cell expressions depending on the condition of
the donor samples.
The red box indicates the differential
expressed genes in regulatory T cells. If we take a look at the sample
labels at the bottom, the samples with high expressions mostly consists
of S1(hospitalized to severe) and R3(recovered) samples. This supports
the published articles discovery that there are profound differences in
gene expression depends on the donor conditions. We will be focused on
this group in the analysis going forward.
The yellow box
indicates differntial expressed genes in conventional T cells. If we
take a look at the sample labels at the bottom, the samples with high
expressions mostly consists of S1(hospitalized to severe) and
R3(recovered) samples. This shows that there is also differences in gene
expression in conventional T cells depends on the donor conditions.
However, there are different groups of genes being expressed
differently.
Here, we will follow the sample process as before, we create a new model considering the condition of donors here.
# Create a matrix from the dataframe retrieved in step1 named T_norm
T_conv <- healthy_samples_r <- hm_rnorm[, grep(colnames(hm_rnorm_df), pattern="\\TR")]
t_conv_m <- as.matrix(T_conv)
rownames(t_conv_m) <- rownames(T_conv)
colnames(t_conv_m) <- colnames(T_conv)
# Define the expression matrix as the assayData
t_conv_set <- Biobase::ExpressionSet(assayData = t_conv_m)
# A more complex model, add condition of donor as another factor
conv_sample_labels <- sample_labels[grep(rownames(sample_labels), pattern="\\TR"),]
tconv_model <- model.matrix(~ conv_sample_labels$condition)
knitr::kable(tconv_model[1:5,],type="html") %>% kableExtra::row_spec(0, angle = -3)
| (Intercept) | conv_sample_labels\(conditionM2 </th> <th style="text-align:right;-webkit-transform: rotate(-3deg); -moz-transform: rotate(-3deg); -ms-transform: rotate(-3deg); -o-transform: rotate(-3deg); transform: rotate(-3deg);"> conv_sample_labels\)conditionR3 | conv_sample_labels$conditionS1 | |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
# Fit data to the above model
tconv_fit <- lmFit(t_conv_set, tconv_model)
# Apply empircal Bayes to compute differntial expression
tconv_fit_B <- limma::eBayes(tconv_fit, trend = TRUE)
# Pick top hits
# Use Benjamini-hochberg("BH") method for correction
tconv_fit_B_top <- limma::topTable(fit = tconv_fit_B,
coef = ncol(tconv_model),
adjust.method = "BH",
number = nrow(t_conv_m))
hits_tconv <- tconv_fit_B_top[order(tconv_fit_B_top$P.Value),]
kable(hits_tconv[1:10,],type="html",row.names = TRUE, digits = 50)
| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| LINC00402 | -2.428362 | 0.35770630 | -5.981572 | 1.726938e-07 | 0.0008347817 | 7.041076 |
| CCNI2 | -2.100908 | -0.18526348 | -5.977400 | 1.753914e-07 | 0.0008347817 | 7.026915 |
| NOG | -2.260541 | -0.06235329 | -5.943324 | 1.990549e-07 | 0.0008347817 | 6.911281 |
| RIMKLB | -1.178608 | -0.67656775 | -5.896384 | 2.369183e-07 | 0.0008347817 | 6.752167 |
| SENP6 | -1.745976 | -0.42735947 | -5.655988 | 5.756277e-07 | 0.0010384777 | 5.940805 |
| OTUB1 | 2.029967 | 0.52883507 | 5.632157 | 6.283267e-07 | 0.0010384777 | 5.860732 |
| N4BP2L2 | -1.965157 | -0.34626537 | -5.628162 | 6.376167e-07 | 0.0010384777 | 5.847316 |
| TXK | -1.441013 | -0.51361064 | -5.618356 | 6.609999e-07 | 0.0010384777 | 5.814394 |
| TPX2 | 1.958518 | 0.58708835 | 5.593130 | 7.251245e-07 | 0.0010384777 | 5.729758 |
| BUB1 | 2.026393 | 0.57070400 | 5.546535 | 8.601628e-07 | 0.0010384777 | 5.573650 |
length(which(hits_tconv$P.Value < 0.05))
## [1] 3034
length(which(hits_tconv$adj.P.Val < 0.05))
## [1] 892
length(which(hits_tconv$P.Value < 0.01))
## [1] 1464
length(which(hits_tconv$adj.P.Val < 0.01))
## [1] 161
These results from only analyzing the conventional T cells. When we use the threshold of 0.05, there are 3034 genes that are found to be differntially expressed, and 892 genes passed correction. When we use the threshold of 0.01, there are 1464 genes that are found to be differntially expressed, and 161 genes passed correction. Overall, this model seems to be less optimal than complex model before. However, considering the complexity of the factors that contributing to the dataset, we will first perform analysis on only the conventional T cell set. Then, we will continue and output the up and down regulated genes to text files for the analysis in the next section.
# Extract up and down regulated data
hits_tconv_stringent <- hits_tconv[which(hits_tconv$P.Value < 0.01 &
hits_tconv$adj.P.Val < 0.01),]
hits_tconv_stringent_up <- hits_tconv[which(hits_tconv$logFC > 0),]
hits_tconv_stringent_down <- hits_tconv[which(hits_tconv$logFC < 0),]
# output files in the result subdirectory
write.table(rownames(hits_tconv_stringent),
file = file.path(getwd(), "results","ranked_gene_list.txt"),
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
write.table(rownames(hits_tconv_stringent_up),
file = file.path(getwd(), "results","upregulated_gene_list.txt"),
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
write.table(rownames(hits_tconv_stringent_down),
file = file.path(getwd(), "results","downregulated_gene_list.txt"),
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
After performing the Thrshold over-representation analysis, the results from only the conventional T cells are very general, consisting of terms from cell cycle to chromosome rearragements. Therefore, we will also output the differntially expressed genes for the entire gene sets, and run the analysis using this set of genes.
hits_out <- hits_comp[which(hits_comp$P.Value < 0.001 & hits_comp$adj.P.Val < 0.001),]
hits_out_up <- hits_comp[which(hits_comp$logFC > 0),]
hits_out_down <- hits_comp[which(hits_comp$logFC < 0),]
write.table(rownames(hits_out),
file = file.path(getwd(), "results","ranked_gene_list_all.txt"),
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
write.table(rownames(hits_out_up),
file = file.path(getwd(), "results","upregulated_gene_list_all.txt"),
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
write.table(rownames(hits_out_down),
file = file.path(getwd(), "results","downregulated_gene_list_all.txt"),
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
As we seen in the differential gene analysis, the data cluster well. However, because this data is consisted of too many factors(patients, donors, cell types). Therefore, I decided to only focused on one comparison on Treg level among different groups. I chose this pair of comparison because this is the area that is was most focus on in the published article.
In this following section, I will use g:GOST function in g:Profiler to fun the analysis. The parameters settings are shown below.
knitr::include_graphics(file.path(getwd(), "figures",
"Procedure1.png"))
Figure 9: Parameters for setting up the g:GOST search(Part 1). Screenshot from g:Profiler web browser.
knitr::include_graphics(file.path(getwd(), "figures",
"Procedure2.png"))
Figure 10: Parameters for setting up the g:GOST search(Part 2). Screenshot from g:Profiler web browser.
The parameter settings are set as in lectures and g:Profiler user manual. The term size in the detailed result page are set as from 5 to 400 to make sure all grab informative terms and exclude terms that are too general or over represented by too few terms. [11] [12] ## Results from conventional T cells set
knitr::include_graphics(file.path(getwd(), "figures",
"conv_ranked_1.png"))
Figure 11: Manhattan plot of conventional T cells enriched terms. Screenshot from g:Profiler web browser.
From the Manhattan plot(P values analysis using annotation resources) above, there is no term passed the significance threshold. We will then looks at the details from the next figures.
knitr::include_graphics(file.path(getwd(), "figures",
"conv_ranked_2.png"))
Figure 12: GO:BP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser.
This figure shows the GO:BP terms from the g:Profiler analysis. There are very general go term results from cell cycle, cell division and chromosome segregation. These terms are very general and not informative for this gene set.
knitr::include_graphics(file.path(getwd(), "figures",
"conv_ranked_3.png"))
Figure 13: KEGG, REAC and WP terms table from g:Profiler analysis. REAC represents Reactome, and WP represents Wiki Pathways. Screenshot from g:Profiler web browser.
This figure shows the KEGG and REAC terms from the g:Profiler analysis. There are very general go term results from glycolysis, proteasome, and mitotic cell cycle. These terms are very general and not informative for this gene set. I also tried for the up regulated and down regulated gene sets, but the results are similar to the results above(See link for upregulated and downregulated gene sets). Although there are terms that pass the threshold in the up and down regulated genes, but the results are not informative. Therefore, I will change my gene set and try to look at the entire differentially expressed gene sets. ## Results from all differentially expressed gene set
knitr::include_graphics(file.path(getwd(), "figures",
"ranked_1.png"))
Figure 14: Manhattan plot of all differentially expressed genes. Screenshot from g:Profiler web browser.
There are 1416 GO:BP terms found, 2 KEGG terms, 5 REAC terms, and 5 WP terms found in ranked gene list. Let’s take a closer look by looking at the detailed values.
knitr::include_graphics(file.path(getwd(), "figures",
"ranked_2.png"))
Figure 15: GO:BP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser.
There are many GO:BP terms related to immune system, cell differentiation, and leukocyte activity, such as “T cell activation”, “Immune response-regulating cell surface receptor signal”, “regulation of leukocyte proliferation”.
knitr::include_graphics(file.path(getwd(), "figures",
"ranked_3.png"))
Figure 16: KEGG, REAC and WP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser. REAC represents Reactome, and WP represents Wiki Pathways.
From these pathway analysis results, we can identify term like “Immunoregulatory interactions” and “Interactions of natural killer cells”. Note that there are also terms related to hemotopoietic activity which can be explained by how the samples are retained from the periphery blood.
knitr::include_graphics(file.path(getwd(), "figures",
"up_1.png"))
Figure 17: Manhattan plot of all upregulated genes. Screenshot from g:Profiler web browser.
There are 473 GO:BP terms found, 2 KEGG terms, 136 REAC terms, and 4 WP terms found in upregulated genes list.
knitr::include_graphics(file.path(getwd(), "figures",
"up_2.png"))
Figure 18: GO:BP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser.
The upregulated expression of GO:BP terms are different from ranked all genes. It contains mostly terms regarding DNA replication, cell division, chromosome seperation. This indicates upregulated cell proliferation.
knitr::include_graphics(file.path(getwd(), "figures",
"up_3.png"))
Figure 19: KEGG terms table from g:Profiler analysis. Screenshot from g:Profiler web browser.
knitr::include_graphics(file.path(getwd(), "figures",
"up_4.png"))
Figure 20: REAC terms table from g:Profiler analysis. Screenshot from g:Profiler web browser. REAC represents Reactome.
knitr::include_graphics(file.path(getwd(), "figures",
"up_5.png"))
Figure 21: WP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser. WP represents Wiki Pathways.
The three figures above have identified similar results as the GO term analysis. All three annotation resources identifid pathway terms includes DNA replication, cell cycle, chromosome organization, and mitotic divisions.
knitr::include_graphics(file.path(getwd(), "figures",
"down_1.png"))
Figure 22: Manhattan plot of all downregulated genes. Screenshot from g:Profiler web browser.
There are 1842 GO:BP terms found, 9 KEGG terms, 14 REAC terms, and 6 WP terms found in downregulated genes list.
knitr::include_graphics(file.path(getwd(), "figures",
"down_2.png"))
Figure 23: GO:BP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser.
The downregulated expression of GO:BP terms are different from ranked all genes. It contains mostly terms regarding call activation, leukocyte activation, and regulation of immune system processes. This indicates downregulated immune response.
knitr::include_graphics(file.path(getwd(), "figures",
"down_3.png"))
Figure 24: KEGG terms table from g:Profiler analysis. Screenshot from g:Profiler web browser.
knitr::include_graphics(file.path(getwd(), "figures",
"down_4.png"))
Figure 25: REAC terms table from g:Profiler analysis. Screenshot from g:Profiler web browser. REAC represents Reactome.
knitr::include_graphics(file.path(getwd(), "figures",
"down_5.png"))
Figure 26: WP terms table from g:Profiler analysis. Screenshot from g:Profiler web browser. WP represents Wiki Pathways.
The three figures above have identified similar results as the GO term analysis. All three annotation resources identifid pathway terms includes B cell receptor signalling pathway, natural killer cell mediated cytotoxicity, immunoregulatory interactions, and B cell activation by SARS-COV2. These indicates a downregulation of immune response.
Question 1: Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?
The simple model which only consider the cell type as a contributing factor. When we use the threshold of 0.05, there are 7212 genes that are found to be differntially expressed, and 6411 genes passed correction. When we use the threshold of 0.01, there are 5691 genes that are found to be differntially expressed, and 4981 genes passed correction. The results from the complex model(consider both cell type and donot condition) gives slightly more hits than the simple model. When we use the threshold of 0.05, there are 7259 genes that are found to be differntially expressed, and 6444 genes passed correction. When we use the threshold of 0.01, there are 5740 genes that are found to be differntially expressed, and 5021 genes passed correction. Considering the number of genes found, I increased the threshold to 0.001, and still got hits of 4339 differentially expressed genes, and 3730 of them pass the correction.
Question 2: Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And why? How many genes passed correction?
I used Benjamini-Hochberg (fdr) method for multiple hypothesis testing which will control the false discovery rate of our results. I chose this method because this is a commonly used method. When we use the threshold of 0.01, there are 4339 differentially expressed genes, and 3730 of them pass the correction.
Question 3: Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.
knitr::include_graphics(file.path(getwd(), "figures", "volcano_plot_for_differntial_expressed_genes.png"))
Figure 7: Volcano Plot for Differential Expressed Genes. Blue representing downregulated genes, red represents up regulated genes, and grey represents genes that did not pass threshold. Threshold are set as p < 0.001, adjacent p vale < 0.001, and absolute value of log(fold change) > 2. The most up and down regulated genes are labeled with gene symbol in black. Genes with log(fold change) > 800 are labeled in the figure.
As you can see from the above figure. There are many genes with a really high fold change. The labeled genes have a very high fold change(logFC > 800). This threshold is chosen to avoid over crowding the figure.
Question 4: Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.
knitr::include_graphics(file.path(getwd(), "figures",
"t_heatmap_for_differntial_expressed_genes_annotated.png"))
Figure 8: Heatmap for differential expressed genes. The expression level are shown in a color scale from blue to red(from low to high). The differntial expressed genes are analyzed using limma, genes selected here used the threshold of p < 0.001. The red box marks the profound regulatory t cells pertubations in donors, and the yellow box marks the profound conventional t cells pertubations in donors.
As seen in the figures above, the samples cluster nicely by cell
type. The left side of clustered samples represent regulatory T cell
samples from donor samples, whereas the right side of the clustered
samples represent conventional T cell samples from donor samples. These
two clusters have very different gene expressions. The top halves of the
genes are highly expressed in regulatory T cells, but the bottom halves
of the genes are highly expressed in conventional T cells.
If we
take a look at the left and the right halves of the figure separately,
we can see differential cell expressions depending on the condition of
the donor samples.
The red box indicates the differential
expressed genes in regulatory T cells. If we take a look at the sample
labels at the bottom, the samples with high expressions mostly consists
of S1(hospitalized to severe) and R3(recovered) samples. This supports
the published articles discovery that there are profound differences in
gene expression depends on the donor conditions. We will be focused on
this group in the analysis going forward.
The yellow box
indicates differntial expressed genes in conventional T cells. If we
take a look at the sample labels at the bottom, the samples with high
expressions mostly consists of S1(hospitalized to severe) and
R3(recovered) samples. This shows that there is also differences in gene
expression in conventional T cells depends on the donor conditions.
However, there are different groups of genes being expressed
differently.
Question 1: Which method did you choose and why?
I chose g:Profiler to analyze the data because this is usually used to deal with a small set of enriched genes. In this case, it fits to use this method. It is also taught in lectures and homework, so it will also be more handy for me to use it. The threshold of 0.05 was chosen to make sure all three gene list can return some results(There is only 7 gene sets identified in the downregulated genes, if further make the threshold smaller, we will not get any results from the downregulated gene set.).
Question 2: What annotation data did you use and why? What version of the annotation are you using?
I used GO:BP in Gene Ontology
databse, KEGG, Reactome and WikiPathway to annotate my data.
hsapien(Human) -version: GRCh38.p13
GO:BP: annotations: BioMart (releases/2022-12-04)
KEGG:KEGG FTP Release 2022-12-26
REAC: annotations: BioMart
(releases/2022-12-28)
WikiPathways: 20221210
Question 3: How many genesets were returned with what thresholds?
Threshold is set at 0.05 for g:Profiler analysis, the terms are set
to from 8-300. Ranked_genes:There are 1416 GO:BP terms found, 2 KEGG terms, 5 REAC terms, and 5 WP
terms found in ranked gene list. Let’s take a closer look by looking at
the detailed values.
Unregulated_genes: There are 473 GO:BP terms found, 2 KEGG terms, 136 REAC
terms, and 4 WP terms found in upregulated genes list.
Downregulated_genes:There are 1842 GO:BP
terms found, 9 KEGG terms, 14 REAC terms, and 6 WP terms found in
downregulated genes list.
Question 4: Run the analysis using the up-regulated set of genes, and down-regulated set of genes separately. How do these results compare to using the whole list?
There are many GO:BP terms related to immune system, cell differentiation, and leukocyte activity, such as “T cell activation”, “Immune response-regulating cell surface receptor signal”, “regulation of leukocyte proliferation”.The upregulated expression of GO:BP terms are different from ranked all genes. It contains mostly terms regarding DNA replication, cell division, chromosome seperation. This indicates upregulated cell proliferation. The downregulated expression of GO:BP terms are different from ranked all genes. It contains mostly terms regarding call activation, leukocyte activation, and regulation of immune system processes. This indicates downregulated immune response.
Question 1: Do the over-representation results support conclusions or mechanism discussed in the original paper?
The over-representation results support the conclusion and the mechanism discussed in the paper. I have identified multiple pathways that are related to the immune system and cell proliferation pathway. The downregulated portion of genes have enriched terms like immune response, which supports the paper’s discovery of very severe T cell expression differences depending on the condition of patients. The upregulated portion of genes have enriched terms like cell proliferation which suggest rapid cell division which can be explained by the virus replication processes.
Question 2: Can you find evidence, i.e publications, to support some of the results that you see. How does this evidence support your results.
There are multiple publications that also find evidence that immune response are downregulated in COVID patients which supports my findings here.@immune However, there is no evidence showing that there will be upregulation of cell proliferation or cell division in COVID patients. This could be explained by how the virus can use cell cycle arrest to promote replication of viral factors.
#Reference